Overview

Dataset statistics

Number of variables11
Number of observations699
Missing cells0
Missing cells (%)0.0%
Duplicate rows8
Duplicate rows (%)1.1%
Total size in memory60.2 KiB
Average record size in memory88.2 B

Variable types

NUM9
CAT2

Reproduction

Analysis started2020-06-04 12:03:57.845005
Analysis finished2020-06-04 12:04:34.688440
Duration36.84 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

Dataset has 8 (1.1%) duplicate rows Duplicates
UnifShape is highly correlated with UnifSizeHigh correlation
UnifSize is highly correlated with UnifShapeHigh correlation

Variables

ID
Real number (ℝ≥0)

Distinct count645
Unique (%)92.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1071704.0987124464
Minimum61634
Maximum13454352
Zeros0
Zeros (%)0.0%
Memory size5.5 KiB

Quantile statistics

Minimum61634
5-th percentile411453
Q1870688.5
median1171710
Q31238298
95-th percentile1333890.8
Maximum13454352
Range13392718
Interquartile range (IQR)367609.5

Descriptive statistics

Standard deviation617095.7298
Coefficient of variation (CV)0.5758079404
Kurtosis257.7171591
Mean1071704.099
Median Absolute Deviation (MAD)104381
Skewness13.67532594
Sum749121165
Variance3.808071398e+11
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
118240460.9%
 
127609150.7%
 
119864130.4%
 
46690620.3%
 
111611620.3%
 
107093520.3%
 
38510320.3%
 
129343920.3%
 
124060320.3%
 
127779220.3%
 
Other values (635)67196.0%
 
ValueCountFrequency (%) 
6163410.1%
 
6337510.1%
 
7638910.1%
 
9571910.1%
 
12805910.1%
 
ValueCountFrequency (%) 
1345435210.1%
 
823370410.1%
 
137192010.1%
 
137102610.1%
 
136982110.1%
 

Clump
Real number (ℝ≥0)

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean4.417739628040057
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.5 KiB

Quantile statistics

Minimum1
5-th percentile1
Q12
median4
Q36
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)4

Descriptive statistics

Standard deviation2.815740659
Coefficient of variation (CV)0.6373713473
Kurtosis-0.6237154123
Mean4.417739628
Median Absolute Deviation (MAD)2
Skewness0.5928585327
Sum3088
Variance7.928395456
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
114520.7%
 
513018.6%
 
310815.5%
 
48011.4%
 
10699.9%
 
2507.2%
 
8466.6%
 
6344.9%
 
7233.3%
 
9142.0%
 
ValueCountFrequency (%) 
114520.7%
 
2507.2%
 
310815.5%
 
48011.4%
 
513018.6%
 
ValueCountFrequency (%) 
10699.9%
 
9142.0%
 
8466.6%
 
7233.3%
 
6344.9%
 

UnifSize
Real number (ℝ≥0)

HIGH CORRELATION

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.13447782546495
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.5 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q35
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)4

Descriptive statistics

Standard deviation3.05145911
Coefficient of variation (CV)0.9735143395
Kurtosis0.09880288537
Mean3.134477825
Median Absolute Deviation (MAD)0
Skewness1.233136558
Sum2191
Variance9.3114027
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
138454.9%
 
10679.6%
 
3527.4%
 
2456.4%
 
4405.7%
 
5304.3%
 
8294.1%
 
6273.9%
 
7192.7%
 
960.9%
 
ValueCountFrequency (%) 
138454.9%
 
2456.4%
 
3527.4%
 
4405.7%
 
5304.3%
 
ValueCountFrequency (%) 
10679.6%
 
960.9%
 
8294.1%
 
7192.7%
 
6273.9%
 

UnifShape
Real number (ℝ≥0)

HIGH CORRELATION

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.207439198855508
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.5 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q35
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)4

Descriptive statistics

Standard deviation2.971912767
Coefficient of variation (CV)0.9265686995
Kurtosis0.007010980047
Mean3.207439199
Median Absolute Deviation (MAD)0
Skewness1.161859179
Sum2242
Variance8.832265496
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
135350.5%
 
2598.4%
 
10588.3%
 
3568.0%
 
4446.3%
 
5344.9%
 
7304.3%
 
6304.3%
 
8284.0%
 
971.0%
 
ValueCountFrequency (%) 
135350.5%
 
2598.4%
 
3568.0%
 
4446.3%
 
5344.9%
 
ValueCountFrequency (%) 
10588.3%
 
971.0%
 
8284.0%
 
7304.3%
 
6304.3%
 

MargAdh
Real number (ℝ≥0)

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2.8068669527896994
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.5 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q34
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)3

Descriptive statistics

Standard deviation2.855379239
Coefficient of variation (CV)1.017283429
Kurtosis0.9879470695
Mean2.806866953
Median Absolute Deviation (MAD)0
Skewness1.524468091
Sum1962
Variance8.1531906
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
140758.2%
 
3588.3%
 
2588.3%
 
10557.9%
 
4334.7%
 
8253.6%
 
5233.3%
 
6223.1%
 
7131.9%
 
950.7%
 
ValueCountFrequency (%) 
140758.2%
 
2588.3%
 
3588.3%
 
4334.7%
 
5233.3%
 
ValueCountFrequency (%) 
10557.9%
 
950.7%
 
8253.6%
 
7131.9%
 
6223.1%
 

SingEpiSize
Real number (ℝ≥0)

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.216022889842632
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.5 KiB

Quantile statistics

Minimum1
5-th percentile1
Q12
median2
Q34
95-th percentile8
Maximum10
Range9
Interquartile range (IQR)2

Descriptive statistics

Standard deviation2.214299887
Coefficient of variation (CV)0.6885211836
Kurtosis2.169066423
Mean3.21602289
Median Absolute Deviation (MAD)0
Skewness1.712171802
Sum2248
Variance4.903123988
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
238655.2%
 
37210.3%
 
4486.9%
 
1476.7%
 
6415.9%
 
5395.6%
 
10314.4%
 
8213.0%
 
7121.7%
 
920.3%
 
ValueCountFrequency (%) 
1476.7%
 
238655.2%
 
37210.3%
 
4486.9%
 
5395.6%
 
ValueCountFrequency (%) 
10314.4%
 
920.3%
 
8213.0%
 
7121.7%
 
6415.9%
 

BareNuc
Categorical

Distinct count11
Unique (%)1.6%
Missing0
Missing (%)0.0%
Memory size5.5 KiB
1
402
10
132
5
 
30
2
 
30
3
 
28
Other values (6)
 
77
ValueCountFrequency (%) 
140257.5%
 
1013218.9%
 
5304.3%
 
2304.3%
 
3284.0%
 
8213.0%
 
4192.7%
 
?162.3%
 
991.3%
 
781.1%
 

Length

Max length2
Median length1
Mean length1.188841202
Min length1

BlandChrom
Real number (ℝ≥0)

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.4377682403433476
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.5 KiB

Quantile statistics

Minimum1
5-th percentile1
Q12
median3
Q35
95-th percentile8
Maximum10
Range9
Interquartile range (IQR)3

Descriptive statistics

Standard deviation2.438364252
Coefficient of variation (CV)0.7092869798
Kurtosis0.1846213115
Mean3.43776824
Median Absolute Deviation (MAD)1
Skewness1.099969082
Sum2403
Variance5.945620227
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
216623.7%
 
316523.6%
 
115221.7%
 
77310.4%
 
4405.7%
 
5344.9%
 
8284.0%
 
10202.9%
 
9111.6%
 
6101.4%
 
ValueCountFrequency (%) 
115221.7%
 
216623.7%
 
316523.6%
 
4405.7%
 
5344.9%
 
ValueCountFrequency (%) 
10202.9%
 
9111.6%
 
8284.0%
 
77310.4%
 
6101.4%
 

NormNucl
Real number (ℝ≥0)

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2.866952789699571
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.5 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q34
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)3

Descriptive statistics

Standard deviation3.053633894
Coefficient of variation (CV)1.065114816
Kurtosis0.4742686755
Mean2.86695279
Median Absolute Deviation (MAD)0
Skewness1.422261257
Sum2004
Variance9.324679956
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
144363.4%
 
10618.7%
 
3446.3%
 
2365.2%
 
8243.4%
 
6223.1%
 
5192.7%
 
4182.6%
 
9162.3%
 
7162.3%
 
ValueCountFrequency (%) 
144363.4%
 
2365.2%
 
3446.3%
 
4182.6%
 
5192.7%
 
ValueCountFrequency (%) 
10618.7%
 
9162.3%
 
8243.4%
 
7162.3%
 
6223.1%
 

Mit
Real number (ℝ≥0)

Distinct count9
Unique (%)1.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1.5894134477825466
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.5 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q31
95-th percentile5
Maximum10
Range9
Interquartile range (IQR)0

Descriptive statistics

Standard deviation1.715077943
Coefficient of variation (CV)1.07906344
Kurtosis12.65787807
Mean1.589413448
Median Absolute Deviation (MAD)0
Skewness3.560657844
Sum1111
Variance2.941492349
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
157982.8%
 
2355.0%
 
3334.7%
 
10142.0%
 
4121.7%
 
791.3%
 
881.1%
 
560.9%
 
630.4%
 
ValueCountFrequency (%) 
157982.8%
 
2355.0%
 
3334.7%
 
4121.7%
 
560.9%
 
ValueCountFrequency (%) 
10142.0%
 
881.1%
 
791.3%
 
630.4%
 
560.9%
 

Class
Categorical

Distinct count2
Unique (%)0.3%
Missing0
Missing (%)0.0%
Memory size5.5 KiB
2
458
4
241
ValueCountFrequency (%) 
245865.5%
 
424134.5%
 

Length

Max length1
Median length1
Mean length1
Min length1

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Missing values

Sample

First rows

IDClumpUnifSizeUnifShapeMargAdhSingEpiSizeBareNucBlandChromNormNuclMitClass
010000255111213112
1100294554457103212
210154253111223112
310162776881343712
410170234113213112
510171228101087109714
6101809911112103112
710185612121213112
810330782111211152
910330784211212112

Last rows

IDClumpUnifSizeUnifShapeMargAdhSingEpiSizeBareNucBlandChromNormNuclMitClass
6896545461111211182
6906545461113211112
691695091510105454414
6927140393111211112
6937632353111212122
6947767153111321112
6958417692111211112
6968888205101037381024
69789747148643410614
69889747148854510414

Duplicate rows

Most frequent

IDClumpUnifSizeUnifShapeMargAdhSingEpiSizeBareNucBlandChromNormNuclMitClasscount
0320675335231071142
146690611112111122
270409711111121122
3110052461010281073342
4111611691010110833142
5119864131112131122
6121886011111131122
7132194251112131122